3¶
Question:
Please do clustering by two different algorithms with one dataset. Select one dataset public which consists of more than 10 features. Please do some tasks Exploratory Data Analysis (EDA), feature engineering and some pre-processing if you feel need it. Please explain each your task results (LO1, LO3, LO4, 17 poin)
Answer
In [ ]:
import pandas as pd
pd.set_option('future.no_silent_downcasting',True)
dt = pd.read_csv('./data.csv')
In [ ]:
# Descripbe Data Shape
print("Data Shape")
print(dt.shape)
print("--------------")
# Describe overall data
print("Data Info")
print(dt.info(memory_usage=False))
print("--------------")
print("Data Description")
print(dt.describe())
print("--------------")
Data Shape
(207, 41)
--------------
Data Info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207 entries, 0 to 206
Data columns (total 41 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 207 non-null int64
1 ID 207 non-null int64
2 Country of Origin 207 non-null object
3 Farm Name 205 non-null object
4 Lot Number 206 non-null object
5 Mill 204 non-null object
6 ICO Number 75 non-null object
7 Company 207 non-null object
8 Altitude 206 non-null object
9 Region 205 non-null object
10 Producer 206 non-null object
11 Number of Bags 207 non-null int64
12 Bag Weight 207 non-null object
13 In-Country Partner 207 non-null object
14 Harvest Year 207 non-null object
15 Grading Date 207 non-null object
16 Owner 207 non-null object
17 Variety 201 non-null object
18 Status 207 non-null object
19 Processing Method 202 non-null object
20 Aroma 207 non-null float64
21 Flavor 207 non-null float64
22 Aftertaste 207 non-null float64
23 Acidity 207 non-null float64
24 Body 207 non-null float64
25 Balance 207 non-null float64
26 Uniformity 207 non-null float64
27 Clean Cup 207 non-null float64
28 Sweetness 207 non-null float64
29 Overall 207 non-null float64
30 Defects 207 non-null float64
31 Total Cup Points 207 non-null float64
32 Moisture Percentage 207 non-null float64
33 Category One Defects 207 non-null int64
34 Quakers 207 non-null int64
35 Color 207 non-null object
36 Category Two Defects 207 non-null int64
37 Expiration 207 non-null object
38 Certification Body 207 non-null object
39 Certification Address 207 non-null object
40 Certification Contact 207 non-null object
dtypes: float64(13), int64(6), object(22)None
--------------
Data Description
Unnamed: 0 ID Number of Bags Aroma Flavor \
count 207.000000 207.000000 207.000000 207.000000 207.000000
mean 103.000000 103.000000 155.449275 7.721063 7.744734
std 59.899917 59.899917 244.484868 0.287626 0.279613
min 0.000000 0.000000 1.000000 6.500000 6.750000
25% 51.500000 51.500000 1.000000 7.580000 7.580000
50% 103.000000 103.000000 14.000000 7.670000 7.750000
75% 154.500000 154.500000 275.000000 7.920000 7.920000
max 206.000000 206.000000 2240.000000 8.580000 8.500000
Aftertaste Acidity Body Balance Uniformity Clean Cup \
count 207.000000 207.00000 207.000000 207.000000 207.000000 207.0
mean 7.599758 7.69029 7.640918 7.644058 9.990338 10.0
std 0.275911 0.25951 0.233499 0.256299 0.103306 0.0
min 6.670000 6.83000 6.830000 6.670000 8.670000 10.0
25% 7.420000 7.50000 7.500000 7.500000 10.000000 10.0
50% 7.580000 7.67000 7.670000 7.670000 10.000000 10.0
75% 7.750000 7.87500 7.750000 7.790000 10.000000 10.0
max 8.420000 8.58000 8.250000 8.420000 10.000000 10.0
Sweetness Overall Defects Total Cup Points Moisture Percentage \
count 207.0 207.000000 207.0 207.000000 207.000000
mean 10.0 7.676812 0.0 83.706570 10.735266
std 0.0 0.306359 0.0 1.730417 1.247468
min 10.0 6.670000 0.0 78.000000 0.000000
25% 10.0 7.500000 0.0 82.580000 10.100000
50% 10.0 7.670000 0.0 83.750000 10.800000
75% 10.0 7.920000 0.0 84.830000 11.500000
max 10.0 8.580000 0.0 89.330000 13.500000
Category One Defects Quakers Category Two Defects
count 207.000000 207.000000 207.000000
mean 0.135266 0.690821 2.251208
std 0.592070 1.686918 2.950183
min 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000
50% 0.000000 0.000000 1.000000
75% 0.000000 1.000000 3.000000
max 5.000000 12.000000 16.000000
--------------
In [ ]:
from ydata_profiling import ProfileReport
ProfileReport(dt, title="Profiling Report")
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]